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Abstract —Domain adaptation is one of the most challenging tasks of modern data analytics. If the adaptation is done correctly, 
models built on a specific data representation become more robust when confronted to data depicting the same classes, but 
described by another observation system. Among the many strategies proposed, finding domain-invariant representations has 
shown excellent properties, in particular since it allows to train a unique classifier effective in all domains. In this paper, we 
propose a regularized unsupervised optimal transportation model to perform the alignment of the representations in the source 
and target domains. We learn a transportation plan matching both PDFs, which constrains labeled samples of the same class 
in the source domain to remain close during transport. This way, we exploit at the same time the labeled samples in the source 
and the distributions observed in both domains. Experiments on toy and challenging real visual adaptation examples show 
the interest of the method, that consistently outperforms state of the art approaches. In addition, numerical experiments show 
that our approach leads to better performances on domain invariant deep learning features and can be easily adapted to the 
semi-supervised case where few labeled samples are available in the target domain. 

Index Terms —Unsupervised Domain Adaptation, Optimal Transport, Transfer Learning, Visual Adaptation, Classification. 
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1 Introduction 

ODERN data analytics are based on the avail¬ 
ability of large volumes of data, sensed by a 
variety of acquisition devices and at high temporal 
frequency. But this large amounts of heterogeneous 
data also make the task of learning semantic concepts 
more difficult, since the data used for learning a 
decision function and those used for inference tend 
not to follow the same distribution. Discrepancies 
(also known as drift) in data distribution are due 
to several reasons and are application-dependent. In 
computer vision, this problem is known as the vi¬ 
sual adaptation domain problem, where domain drifts 
occur when changing lighting conditions, acquisition 
devices, or by considering the presence or absence of 
backgrounds. In speech processing, learning from one 
speaker and trying to deploy an application targeted 
to a wide public may also be hindered by the dif¬ 
ferences in background noise, tone or gender of the 
speaker. In remote sensing image analysis, one would 
like to leverage from labels defined over one city 
image to classify the land occupation of another city. 
The drifts observed in the probability density function 
(PDF) of remote sensing images are caused by variety 
of factors: different corrections for atmospheric scat¬ 
tering, daylight conditions at the hour of acquisition 
or even slight changes in the chemical composition of 
the materials. 

For those reasons, several works have coped with 
these drift problems by developing learning methods 
able to transfer knowledge from a source domain to 
a target domain for which data have different PDFs. 
Learning in this PDF discrepancy context is denoted 
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as the domain adaptation problem [37]. In this work, 
we address the most difficult variant of this problem, 
denoted as unsupervised domain adaptation, where 
data labels are only available in the source domain. 
We tackle this problem by assuming that the effects 
of the drifts can be reduced if data undergo a phase 
of adaptation (typically, a non-linear mapping) where 
both domains look more alike. 

Several theoretical works [2], [36], [22] have empha¬ 
sized the role played by the divergence between the 
data probability distribution functions of the domains. 
These works have led to a principled way of solving 
the domain adaptation problem: transform data so as 
to make their distributions "closer", and use the label 
information available in the source domain to learn 
a classifier in the transformed domain, which can be 
applied to the target domain. Our work follows the 
same intuition and proposes a transformation of the 
source data that fits a least effort principle, i.e. an 
effect that is minimal with respect to a transformation 
cost or metric. In this sense, the adaptation problem 
boils down to: i) finding a transformation of the input 
data matching the source and target distributions and 
then ii) learning a new classifier from the transformed 
source samples. This process is depicted in Figure 1. 
In this paper, we advocate a solution for finding this 
transformation based on optimal transport. 

Optimal Transport (OT) problems have recently 
raised interest in several fields, in particular because 
OT theory can be used for computing distances 
between probability distributions. Those distances, 
known under several names in the literature (Wasser- 
stein, Monge-Kantorovich or Earth Mover distances) 
have important properties: i) They can be evalu¬ 
ated directly on empirical estimates of the distribu- 
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Dataset Optimal transport Classification on transported samples 



Fig. 1: Illustration of the proposed approach for domain adaptation, (left) dataset for training, i.e. source 
domain, and testing, i.e. target domain. Note that a classifier estimated on the training examples clearly does 
not fit the target data, (middle) a data dependent transportation map T^o is estimated and used to transport 
the training samples onto the target domain. Note that this transformation is usually not linear, (right) the 
transported labeled samples are used for estimating a classifier in the target domain. 


tions without having to smoothen them using non- 
parametric or semi-parametric approaches; ii) By ex¬ 
ploiting the geometry of the underlying metric space, 
they provide meaningful distances even when the 
supports of the distributions do not overlap. Leverag¬ 
ing from these properties, we introduce a novel frame¬ 
work for unsupervised domain adaptation, which 
consists in learning an optimal transportation based 
on empirical observations. In addition, we propose 
several regularization terms that favor learning of 
better transformations w.r.t. the adaptation problem. 
They can either encode class information contained 
in the source domain or promote the preservation 
of neighborhood structures. An efficient algorithm is 
proposed for solving the resulting regularized op¬ 
timal transport optimization problem. Finally, this 
framework can also easily be extended to the semi- 
supervised case, where few labels are available in the 
target domain, by a simple and elegant modification 
in the optimal transport optimization problem. 

The remainder of this Section presents related 
works, while Section 2 formalizes the problem of un¬ 
supervised domain adaptation and discusses the use 
of optimal transport for its resolution. Section 3 intro¬ 
duces optimal transport and its regularized version. 
Section 4 presents the proposed regularization terms 
tailored to fit the domain adaptation constraints. Sec¬ 
tion 5 discusses algorithms for solving the regular¬ 
ized optimal transport problem efficiently. Section 6 
evaluates the relevance of our domain adaptation 
framework through both synthetic and real-world 
examples. 

1.1 Related works 

Domain adaptation. Domain adaptation strategies 
can be roughly divided in two families, depending 
on whether they assume the presence of few labels 
in the target domain (semi-supervised DA) or not 
(unsupervised DA). 


In the first family, methods which have been pro¬ 
posed include searching for projections that are dis¬ 
criminative in both domains by using inner products 
between source samples and transformed target sam¬ 
ples [42], [32], [29]. Learning projections, for which 
labeled samples of the target domain fall on the 
correct side of a large margin classifier trained on 
the source data, have also been proposed [27]. Several 
works based on extraction of common features under 
pairwise constraints have also been introduced as 
domain adaptation strategies [26], [52], [47]. 

The second family tackles the domain adaptation 
problem assuming, as in this paper, that no labels are 
available in the target domain. Besides works dealing 
with sample re weighting [46], many works have con¬ 
sidered finding a common feature representation for 
the two (or more) domains. Since the representation, 
or latent space, is common to all domains, projected 
labeled samples from the source domain can be used 
to train a classifier that is general [18], [38]. A common 
strategy is to propose methods that aim at finding rep¬ 
resentations in which domains match in some sense. 
For instance, adaptation can be performed by match¬ 
ing the means of the domains in the feature space [38], 
aligning the domains by their correlations [33] or 
by using pairwise constraints [51]. In most of these 
works, feature extraction is the key tool for finding 
a common latent space that embeds discriminative 
information shared by all domains. 

Recently, the unsupervised domain adaptation 
problem has been revisited by considering strategies 
based on a gradual alignment of a feature repre¬ 
sentation. In [24], authors start from the hypothesis 
that domain adaptation can be better estimated when 
comparing gradual distortions. Therefore, they use 
intermediary projections of both domains along the 
Grassmannian geodesic connecting the source and 
target eigenvectors. In [23], [54], all sets of trans¬ 
formed intermediary domains are obtained by using 
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a geodesic-flow kernel. While these methods have 
the advantage of providing easily computable out- 
of-sample extensions (by projecting unseen samples 
onto the latent space eigenvectors), the transformation 
defined remains global and is applied in the same way 
to the whole target domain. An approach combining 
sample reweighting logic with representation trans¬ 
fer is found in [53], where authors extend the sam¬ 
ple re-weighing to reproducing kernel Hilbert space 
through the use of surrogate kernels. The transforma¬ 
tion achieved is again a global linear transformation 
that helps in aligning domains. 

Our proposition strongly differs from those re¬ 
viewed above, as it defines a local transformation 
for each sample in the source domain. In this sense, 
the domain adaptation problem can be seen as a 
graph matching problem [35], [10], [11] as each source 
sample has to be mapped on target samples under the 
constraint of marginal distribution preservation. 
Optimal Transport and Machine Learning. The op¬ 
timal transport problem has first been introduced 
by the French mathematician Gaspard Monge in the 
middle of the 19th century as a way to find a mini¬ 
mal effort solution to the transport of a given mass 
of dirt into a given hole. The problem reappeared 
in the middle of the 20 th century in the work of 
Kantorovitch [30] and found recently surprising new 
developments as a polyvalent tool for several funda¬ 
mental problems [49]. It was applied in a wide panel 
of fields, including computational fluid mechanics [3], 
color transfer between multiple images or morphing 
in the context of image processing [40], [20], [5], inter¬ 
polation schemes in computer graphics [ 6 ], and eco¬ 
nomics, via matching and equilibriums problems [ 12 ]. 

Despite the appealing properties and application 
success stories, the machine learning community has 
considered optimal transport only recently (see, for 
instance, works considering the computation of dis¬ 
tances between histograms [15] or label propagation 
in graphs [45]); the main reason being the high com¬ 
putational cost induced by the computation of the 
optimal transportation plan. However, new comput¬ 
ing strategies have emerged [15], [17], [5] and made 
possible the application of OT distances in operational 
settings. 

2 Optimal transport and application 

TO DOMAIN ADAPTATION 

In this section, we present the general unsupervised 
domain adaptation problem and show how it can be 
addressed from an optimal transport perspective. 

2.1 Problem and theoretical motivations 

Let G be an input measurable space of di¬ 
mension d and C the set of possible labels. V{fl) 
denotes the set of all probability measures over fl. The 


standard learning paradigm assumes the existence of 
a set of training data associated with 

a set of class labels with yf G C, and 

a testing set Xt = with unknown labels. In 

order to infer the set of labels associated with 
Xt, one usually relies on an empirical estimate of the 
joint probability distribution P(x, ^) eV{flxC) from 
(X 5 , Yg), and assumes that X^ and X^ are drawn from 
the same distribution P(x) G 

2.2 Domain adaptation as a transportation prob- 
iem 

In domain adaptation problems, one assumes the 
existence of two distinct joint probability distributions 
Ps(x^, ^) and Pt(x^, y), respectively related to a source 
and a target domains, noted as fig and Qf In the 
following, ys and yt are their respective marginal 
distributions over X. We also denote fs and ft the true 
labeling functions, i.e. the Bayes decision functions in 
each domain. 

At least one of the two following assumptions is 
generally made by most domain adaptation methods: 

• Class imbalance: Label distributions are different 
in the two domains (Ps(^) / Pt(^))/ bnt the con¬ 
ditional distributions of the samples with respect 
to the labels are the same (P^ (x*|y) = Pt(x*|y)); 

• Covariate shift: Conditional distributions of 
the labels with respect to the data are equal 
(P 5 ( 2 /|x*) = Pt(y|x*), or equivalently = ft = 
/). However, data distributions in the two do¬ 
mains are supposed to be different (Ps(x^) / 
Pt(x^)). For the adaptation techniques to be ef¬ 
fective, this difference needs to be small [ 2 ]. 

In real world applications, the drift occurring between 
the source and the target domains generally implies a 
change in both marginal and conditional distributions. 

In our work, we assume that the domain drift is due 
to an unknown, possibly nonlinear transformation of 
the input space T : Lig ^ Lif This transformation 
may have a physical interpretation (e.g. change in the 
acquisition conditions, sensor drifts, thermal noise, 
etc.). It can also be directly caused by the unknown 
process that generates the data. Additionnally, we 
also suppose that the transformation preserves the 
conditional distribution, i.e. 

P,(y|x^)=P*(y|T(x*)). 

This means that the label information is preserved by 
the transformation, and the Bayes decision functions 
are tied through the equation /t(T(x)) = /s(x). 

Another insight can be provided regarding the 
transformation T. From a probabilistic point of view, 
T transforms the measure y in its image measure, noted 
T#/i, which is another probability measure over 
satisfying 

T#Mx)=MT-'(x)), VxG^lt (1) 
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T is said to be a transport map or push-forward from 
/is to fit if T#/is = /it (as illustrated in Figure 2.a). 
Under this assumption, Xt are drawn from the same 
PDF as T#/ig. This provides a principled way to solve 
the adaptation problem: 

1) Estimate jig and /it from and Xt (Equation 

( 6 )) 

2) Find a transport map T from jig to /it 

3) Use T to transport labeled samples X^ and train 
a classifier from them. 

Searching for T in the space of all possible trans¬ 
formations is intractable, and some restrictions need 
to be imposed. Here, we propose that T should be 
chosen so as to minimize a transportation cost C{T) 
expressed as: 

C(T) = [ c(x,T(x))(i/i(x), (2) 

J Qg 

where the cost function c : Qg xQf ^ IR+ is a distance 
function over the metric space U. C{T) can be inter¬ 
preted as the energy required to move a probability 
mass /i(x) from x to T(x). 

The problem of finding such a transportation of 
minimal cost has already been investigated in the 
literature. For instance, the optimal transportation 
problem as defined by Monge is the solution of the 
following minimization problem: 


To = argmin 

T 



c(x,T(x))d/i(x), 


s.t. ’T^/jjg — jjjf 


(3) 

The Kantorovitch formulation of the optimal trans¬ 
portation [30] is a convex relaxation of the above 
Monge problem. Indeed, let us define IT as the set of 
all probabilistic couplings G V{^gXQt) with marginals 
jj.g and Ilf The Kantorovitch problem seeks for a 
general coupling 7 G 11 between Qg and Qf 



In this formulation, 7 can be understood as a joint 
probability measure with marginals fig and iit as 
depicted in Figure 2.b. 70 is also known as transporta¬ 
tion plan [43]. It allows to define the Wasserstein 
distance of order p between pg and pf This distance 
is formalized as 




( inf 


= inf 


[ d(x",x‘fd7(x*,x*) 

jQgXQt 


E 


d(x^x'f 




(5) 


where d is a distance and the corresponding cost func¬ 
tion c(x^,x^) = (i(x^,x^)^. The Wasserstein distance 
is also known as the Earth Mover Distance in the 
computer vision community [41] and it defines a met¬ 
ric over the space of integrable squared probability 
measures. 


In the remainder, we consider the squared £2 Eu¬ 
clidean distance as a cost function, c(x, y) = ||x — yII 2 
for computing optimal transportation. As a conse¬ 
quence, we evaluate distances between measures ac¬ 
cording to the squared Wasserstein distance W 2 asso¬ 
ciated with the Euclidean distance (i(x, y) = ||x —y|| 2 . 
The main rationale for this choice is that it experimen¬ 
tally provided the best result on average (as shown in 
the supplementary material). Nevertheless, other cost 
functions better suited to the nature of specific data 
can be considered, depending on the application at 
hand and the data representation, as discussed more 
in details in Section 3.4. 

3 Regularized discrete optimal 

TRANSPORT 

This section discusses the problem of optimal trans¬ 
port for domain adaptation. In the first part, we in¬ 
troduce the OT optimization problem on discrete em¬ 
pirical distributions. Then, we discuss a regularized 
variant of this discrete optimal transport problem. 
Finally, we address the question of how the result¬ 
ing probabilistic coupling can be used for mapping 
samples from source to target domain. 

3.1 Discrete optimal transport 

When fig and pt are only accessible through discrete 
samples, the corresponding empirical distributions 
can be written as 

rig Tit 

Ms = j Pi ; l^t — (b) 

where is the Dirac function at location x^ G 
pf and p\ are probability masses associated to the 
i-th sample and belong to the probability simplex, 
= Yl^LiPi = 1 - ft is straightforward to 
adapt the Kantorovich formulation of optimal trans¬ 
port problem to the discrete case. We denote B the set 
of probabilistic couplings between the two empirical 
distributions defined as: 

B = {7 e 7ln. = = A^t} ( 7 ) 

where is a d-dimensional vector of ones. The 
Kantorovitch formulation of the optimal transport [30] 
reads: 

7 o = argmin ( 7 , C) p ( 8 ) 

'yeB 

where (.,.)^ is the Frobenius dot product and C > 
0 is the cost function matrix, whose term C(i, j) = 
c(x|,xp denotes the cost to move a probability mass 
from x| to Xj. As previously detailed, this cost was 
chosen as the squared Euclidean distance between the 
two locations, i.e. C{i,j) = ||x| — x^ H^. 

Note that when Ug = rit = n and p| = 

p^j = 1/n, 7 o is simply a permutation matrix. In this 
case, the optimal transport problem boils down to 



IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, JANUARY XX 


5 


an optimal assignment problem. In the general case, 
it can be shown that 7 o is a sparse matrix with at 
most Us ^ rit — 1 non zero entries, equating the rank 
of the constraint matrix expressing the two marginal 
constraints. 

Problem ( 8 ) is a linear program and can be solved 
with combinatorial algorithms such as the simplex 
methods and its network variants (successive shortest 
path algorithms, Hungarian or relaxation algorithms). 
Yet, the computational complexity was shown to be 
0{{ns + nt)nsntlog{ns + Ut)) [1, p. 472, Th. 12.2] at 
best, which dampens the utility of the method when 
handling large datasets. However, the regularization 
scheme recently proposed by Cuturi [15] presented in 
the next section, allows a very fast computation of a 
transportation plan. 

3.2 Regularized optimal transport 

Regularization is a classical approach used for pre¬ 
venting overfitting when few samples are available for 
learning. It can also be used for inducing some prop¬ 
erties on the solution. In the following, we discuss 
a regularization term recently introduced for optimal 
transport problem. 

Cuturi [15] proposed to regularize the expression 
of the optimal transport problem by the entropy of 
the probabilistic coupling. The resulting information- 
theoretic regularized version of the transport 7 o is the 
solution of the minimization problem: 

7o = argmin (7, C),, + A^^s(7), ( 9 ) 

tsb 

where ^s{l) = j 7 (hi) log 7(^5 j) computes the 
negentropy of 7 . The intuition behind this form of 
regularization is the following: since most elements 
of 7 o should be zero with high probability, one 
can look for a smoother version of the transport, 
thus lowering its sparsity, by increasing its entropy. 
As a result, the optimal transport 70 will have a 
denser coupling between the distributions, fls(') can 
also be interpreted as a Kullback-Leibler divergence 
AI/( 7 || 7 ^) between the joint probability 7 and a 
uniform joint probability "fu{hj) = Indeed, by 

expanding this KL divergence, we have KL{'y\\j^) = 
logn^nt + Z^i j 7 (h j) log 7(^5 j)- The first term is a con¬ 
stant w.r.t. 7 , which means that we can equivalently 
use KL(7||7„) or 0^(7) = Ejj 7 (*,j) log 7 (*,j) in 
Equation (9). 

Hence, as the parameter A weighting the entropy- 
based regularization increases, the sparsity of 7 o 
decreases and source points tend to distribute their 
probability masses toward more target points. When 
A becomes very large (A ^ oo), the OT solution of 
Equation (9) converges toward 

Another appealing outcome of the regularized OT 
formulation given in Equation (9) is the derivation 
of a computationally efficient algorithm based on 
Sinkhorn-Knopp's scaling matrix approach [31]. This 


efficient algorithm will also be a key element in our 
methodology presented in Section 4. 

3.3 OT-based mapping of the samples 

In the context of domain adaptation, once the proba¬ 
bilistic coupling 7 o has been computed, source sam¬ 
ples have to be transported in the target domain. For 
this purpose, one can interpolate the two distribu¬ 
tions jj^s and jjit by following the geodesics of the 
Wasserstein metric [49, Chapter 7], parameterized by 
t G [ 0 , 1 ]. This defines a new distribution jl such that: 

/i = argmin (1 - t)W2(Ms> (10) 

/X 

Still following Villani's book, one can show that for a 
squared ^2 cost, this distribution boils down to: 

A = (11) 

hj 

Since our goal is to transport the source samples onto 
the target distribution, we are mainly interested in the 
case t = 1. For this value of t, the novel distribution 
/i is a distribution with the same support of /it, since 
Equation (11) reduces to 

= ( 12 ) 

3 

with The weights p^j can be seen as 

the sum of probability mass coming from all samples 
{x|} that is transferred to sample x^. Alternatively, 
7o(i, j) also tells us how much probability mass of x| 
is transferred to x^. We can exploit this information 
to compute a transformation of the source samples. 
This transformation can be conveniently expressed 
with respect to the target samples as the following 
barycentric mapping: 

x| = argmin y' 7 o(i, j)c(x, x*). (13) 

j 

where x| is a given source sample and x| is its 
corresponding image. When the cost function is the 
squared £2 distance, this barycenter corresponds to 
a weighted average and the sample is mapped into 
the convex hull of the target samples. For all source 
samples, this barycentric mapping can therefore be 
expressed as: 

X, = T^„(X,) = diag(7olnJ-'7oXt. (14) 

The inverse mapping from the target to the source 
domain can also be easily computed from 70 - In¬ 
terestingly, one can show [17, Eq. 8 ] that this trans¬ 
formation is a first order approximation of the true 
Us Wasserstein barycenters of the target distributions. 
Also note that when marginals ps and pt are uniform, 
one can easily derive the barycentric mapping as a 
linear expression: 

Xs = ns-yoXt and Xt = nt^Q X* (15) 
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for the source and target samples. Finally, remark that 
if 7 o(i, j) = then each transported source 

point converges toward the center of mass of the 
target distribution that is ^ . This occurs when 

A —> oo in Equation (9). 

3.4 Discussing optimai transport for domain 
adaptation 

We discuss here the requirements and conditions of 
applicability of the proposed method. 

Guarantees of recovery of the correct transforma¬ 
tion. Our goal for achieving domain adaptation is 
to uncover the transformation that occurred between 
source and target distributions. While the family of 
transformation that an OT formulation can recover is 
wide, we provide a proof that, for some simple affine 
transformations of discrete distributions, our OT so¬ 
lution is able to match source and target examples 
exactly. 

Theorem 3.1: Let and jT be two discrete distribu¬ 
tions with n Diracs as defined in Equation ( 6 ). If the 
following conditions hold 

1) The source samples in are x| G G 

1 ,..., n such that x| 7 ^ xj if i 7 ^ j . 

2) All weights in the source and target distributions 
are -. 

n 

3) The target samples are defined as x^ = Ax| + b 
i.e. an affine tranformation of the source samples. 

4) b G and A G 5+ is a strictly positive definite 
matrix. 

5) The cost function is c(x^,x^) = ||x^ — x^|| 2 . 
then the solution Tq of the optimal transport problem 
(8) is so that To(x|) = Ax| + b = x^ Vi G 1,..., n. 

In this case, we retrieve the exact affine 
transformation on the discrete samples, which means 
that the label information are fully preserved during 
transportation. Therefore, one can train a classifier on 
the mapped samples with no generalization loss. We 
provide a simple demonstration in the supplementary 
material. 

Choosing the cost function. In this work, we have 
mainly considered a ^ 2 -based cost function. Let us 
now discuss the implication of using a different cost 
function in our framework. A number of norm-based 
distances have been investigated by mathematicians 
[49, p 972]. Other types of metrics can also be con¬ 
sidered, such as Riemannian distances over a man¬ 
ifold [49, Part II], or learnt metrics [16]. Concave 
cost functions are also of particular use in real life 
problems [21]. Each different cost function will lead 
to a different OT plan 70 , but the cost itself does not 
impact the OT optimization problem, i.e. the solver 
is independent from the cost function. Nonetheless, 
since c(',') defines the Wasserstein geodesic, the in¬ 
terpolation between domains defined in Equation 
( 10 ) leads to a different trajectory (potentially non¬ 
unique). Equation (11), which corresponds to c(', •), is 


a squared £2 distance, so it does not hold anymore. 
Nevertheless, the solution of (10) for t = 1 does 
not depend on the cost c and one can still use the 
proposed barycentric mapping (13). For instance if the 
cost function is based on the £i norm, the transported 
samples will be estimated using a component-wise 
weighted median. Unfortunately, for more complex 
cost functions, the barycentric mapping might be 
complex to estimate. 

4 Class-regularization for domain 

ADAPTATION 

In this section we explore regularization terms that 
preserve label information and sample neighborhood 
during transportation. Finally, we discuss the semi- 
supervised case and show that label information in 
the target domain can be effectively included in he 
proposed model. 

4.1 Regularizing the transport with class labels 

Optimal transport, as it has been presented in the 
previous section, does not use any class informa¬ 
tion. However, and even if our goal is unsupervised 
domain adaptation, class labels are available in the 
source domain. This information is typically used only 
during the decision function learning stage, which 
follows the adaptation step. Our proposition is to take 
advantage of the label information for estimating a 
better transport. More precisely, we aim at penalizing 
couplings that match source samples with different 
labels to same target samples. 

To this end, we propose to add a new term to the 
regularized optimal transport, leading to the follow¬ 
ing optimization problem: 

min {'y,C)p +\ns{-<f) + rinc{'y), ( 16 ) 

'yeB 

where > 0 and Uc(') is a class-based regularization 
term. 

In this work, we propose and study two choices 
for this regularizer Uc('). The first is based on group 
sparsity and promotes a probabilistic coupling 70 
where a given target sample receives masses from 
source samples which have same labels. The second 
is based on graph Laplacian regularization and pro¬ 
motes a locally smooth and class-regular structure in 
the source transported samples. 

4.1.1 Regularization with group-sparsity 

With the first regularizer, our objective is to exploit la¬ 
bel information in the optimal transport computation. 
We suppose that all samples in the source domain 
have labels. The main intuition underlying the use 
of this group-sparse regularizer is that we would like 
each target sample to receive masses only from source 
samples that have the same label. As a consequence, 
we expect that a given target sample will be involved 
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Fig. 2: Illustration of the optimal transport problem, (a) Monge problem over 2D domains. T is a push-forward 
from Os to Ot. (b) Kantorovich relaxation over ID domains: 7 can be seen as a joint probability distribution 
with marginals /i^ and /if (c) Illustration of the solution of the Kantorovich relaxation computed between two 
ellipsoidal distributions in 2D. The grey line between two points indicate a non-zero coupling between them. 


in the representation of transported source samples as 
defined in Equation (14), but only for samples from 
the source domain of the same class. This behaviour 
can be induced by means of a group-sparse penalty 
on the columns of 7 . 

This approach has been introduced in our prelimi¬ 
nary work [14]. In that paper, we proposed a 
regularization term with p<l (mainly for algorithmic 
reasons). When applying a majoration-minimization 
technique on the ip—ii norm, the problem can be cast 
as problem (9) and can be solved using the efficient 
Sinkhorn-Knopp algorithm at each iteration. How¬ 
ever, this regularization term with p < 1 is non-convex 
and thus the proposed algorithm is guaranteed to 
converge only to local stationary points. 

In this paper, we retain the convexity of the un¬ 
derlying problem and use the convex group-lasso 
regularizer ii —£2 instead. This regularizer is defined 
as 

( 17 ) 

j cl 

where || • II 2 denotes the £2 norm and Id contains the 
indices of rows in 7 related to source domain samples 
of class cl. Hence, ^{Id,j) is a vector containing 
coefficients of the jth column of 7 associated to class 
cl. Since the jth column of 7 is related to the jth target 
sample, this regularizer will induce the desired sparse 
representation in the target sample. Among other 
benefits, the convexity of the corresponding problem 
allows to use an efficient generic optimization scheme, 
presented in Section 5. 

Ideally, with this regularizer we expect that the 
masses corresponding to each group of labels are 
matching samples of the source and target domains 
exclusively. Hence, for the domain adaptation prob¬ 
lem to have a relevant solution, the distributions 
of labels are expected to be preserved in both the 
source and target distributions. We thus need to have 
Ps{y) = This assumption, which is a classical 

assumption in the field of learning, is nevertheless a 


mild requirement since, in practice, small deviations 
of proportions do not prevent the method from work¬ 
ing (see reference [48] for experimental results on this 
particular issue). 

4.1.2 Laplacian regularization 

This regularization term aims at preserving the data 
structure - approximated by a graph - during trans¬ 
port [20], [13]. Intuitively, we would like similar sam¬ 
ples in the source domain to also be similar after 
transportation. Hence, denote as x| the transported 
source sample x|, with x| being linearly dependent 
on the transportation matrix 7 through Equation (14). 
Now, given a positive symmetric similarity matrix Sg 
of samples in the source domain, our regularization 
term is defined as 

^<=( 7 ) = (18) 

where Ss{i,j) > 0 are the coefficients of matrix 
Sg G that encodes similarity between pairs 

of source sample. In order to further preserve class 
structures, we can sparsify similarities for samples of 
different classes. In practice, we thus impose Ss {i, j) = 
0 if 2/1 ^ Vj- 

The above equation can be simplified when the 
marginal distributions are uniform. In that case, trans¬ 
ported source samples can be computed according to 
Equation (15). Hence, flc( 7 ) boils down to 

neh) = Tr{Xj-f^L,jXt), (19) 

where = diag(Ssl) — is the Laplacian of the 
graph Sg. The regularizer is therefore quadratic w.r.t. 

7- 

The regularization terms (18) or (19) are defined 
based on the transported source samples. When a 
similarity information is also available in the target 
samples, for instance, through a similarity matrix 
St, we can take advantage of this knowledge and a 
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symmetric Laplacian regularization of the form 

nd-f) = (1 - a)Tr{Xj-f^UjXt) + aTv(Xj 

( 20 ) 

can be used instead. In the above equation Lt = 
diag(Stl) — St is the Laplacian of the graph in the 
target domain and 0 < n < 1 is a trade-off param¬ 
eter that weights the importance of each part of the 
regularization term. Note that, unlike the matrix S^, 
the similarity matrix St cannot be sparsified according 
to the class structure, since labels are generally not 
available for the target domain. 

A regularization term similar to f^c(7) has been 
proposed in [20] for histogram adaptation between 
images. However, the authors focused on displace¬ 
ments (x| — x|) instead of on preserving the class 
structure of the transported samples. 

4.2 Regularizing for semi-supervised domain 
adaptation 

In semi-supervised domain adaptation, few labelled 
samples are available in the target domain [50]. Again, 
such an important information can be exploited by 
means of a novel regularization term to be integrated 
in the original optimal transport formulation. This 
regularization term is designed such that samples 
in the target domain should only be matched with 
samples in the source domain that have the same 
labels. It can be expressed as: 

^semi (7) = (7,M) (21) 

where M is a x cost matrix, with = 0 

whenever y| = (or j is a sample with unknown 
label) and -h(X) otherwise. This term has the benefit 
to be parameter free. It boils down to changing the 
original cost function C, defined in Equation (8), by 
adding an infinite cost to undesired matches. Smooth 
versions of this regularization can be devised, for 
instance, by using a probabilistic confidence of target 
sample x^ to belong to class y_^ . Though appealing, 
we have not explored this latter option in this work. 
It is also noticeable that the Laplacian strategy in 
Equation (20) can also leverage on these class labels 
in the target domain through the definition of matrix 
St . 

5 Generalized conditional gradient 

FOR SOLVING REGULARIZED OT PROBLEMS 

In this section, we discuss an efficient algorithm for 
solving optimization problem (16), that can be used 
with any of the proposed regularizers. 

Eirstly, we characterize the existence of a solution 
to the problem. We remark that regularizers given 
in Equations (17) and (18) are continuous, thus the 
objective function is continuous. Moreover, since the 
constraint set S is a convex, closed and bounded 


(hence compact) subset of the objective function 
reaches its minimum on S. In addition, if the regular- 
izer is strictly convex that minimum is unique. This 
occurs for instance, for the Laplacian regularization in 
Equation (18). 

Now, let us discuss algorithms for computing opti¬ 
mal transport solution of problem (16). Eor solving 
a similar problem with a Laplacian regularization 
term, Eerradans et al. [20] used a conditional gradient 
(CG) algorithm [4]. This approach is appealing and 
could be extended to our problem. It is an iterative 
scheme that guarantees any iterate to belong to S, 
meaning that any of those iterates is a transportation 
plan. At each of these iterations, in order to find a 
feasible search direction, a CG algorithm looks for a 
minimizer of the objective function's linear approx¬ 
imation . Hence, at each iteration it solves a Linear 
Program (LP) that is presumably easier to handle than 
the original regularized optimal transport problem. 
Nevertheless, and despite existence of efficient LP 
solvers such as CPLEX or MOSEK, the dimensionality 
of the LP problem makes this LP problem hardly 
tractable, since it involves x rit variables. 

In this work, we aim for a more scalable algorithm. 
To this end, we consider an approach based on a gen¬ 
eralization of the conditional gradient algorithm [7] 
denoted as generalized conditional gradient (GCG). 

The framework of the GCG algorithm addresses the 
general case of constrained minimization of composite 
functions defined as 

min /(7)+5'(7), (22) 

'yeB 

where /(•) is a differentiable and possibly non-convex 
function; g{’) is a convex, possibly non-differentiable 
function; B denotes any convex and compact subset 
of As illustrated in Algorithm 1, all the steps 
of the GCG algorithm are exactly the same as those 
used for CG, except for the search direction part (Line 
3). The difference is that GCG linearizes only part 
/(•) of the composite objective function, instead of 
the full objective function. This approach is justified 
when the resulting nonlinear optimization problem 
can be efficiently solved. The GCG algorithm has been 
shown by Bredies et al. [ 8 ] to converge towards a sta¬ 
tionary point of Problem (22). In our case, since ^( 7 ) 
is differentiable, stronger convergence results can be 
provided (see supplementary material for a discussion 
on convergence rate and duality gap monitoring). 

More specifically, for problem (16) we can set 

fh) = {l,C)p + ri^c{l) and 5 ( 7 ) = Af2s(7)- 

Supposing now that ftci'l) is differentiable, step 3 of 
Algorithm 1 boils down to 

7 * = argmin ( 7 , C + 7/Vflc(7^))^ + 

'yeB 

Interestingly, this problem is an entropy-regularized 
optimal transport problem similar to Problem (9) and 
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Algorithm 1 Generalized Conditional Gradient 

1 : Initialize /c = 0 and 7 ^ G 7^ 

2: repeat 

3: With G G V/( 7 ^), solve 

7 * = argmin ( 7 , G)^ + ^( 7 ) 

-reB 

4: Find the optimal step 

= argmin/( 7 ^ + nA 7 ) + ^( 7 ^ + nA 7 ) 

0<a<l 

with A 7 = 7 * — 7 ^ 

5: 7^+1 ^ 7 ^ + set /c ^ /c + 1 

6 : until Convergence 


can be efficiently solved using the Sinkhorn-Knopp 
scaling matrix approach. 

In our optimal transport problem, is instan¬ 

tiated by the Laplacian or the group-lasso regular¬ 
ization term. The former is differentiable whereas 
the group-lasso is not when there exists a class cl 
and an index j for which ^{Xcuj) is a vector of 0 . 
However, one can note that if the iterate 7 ^ is so 
that 7 ^(Xc/,j) d 0 Vc/,Vj, then the same property 
holds for 7 ^+^. This is due to the exponentiation 
occurring in the Sinkhorn-Knopp algorithm used for 
the entropy-regularized optimal transport problem. 
This means that if we initialize 7 ^ so that 7 ^ {Xd , j) 7 ^ 
0 , then is always differentiable. Hence, our 

GCG algorithm can also be applied to the group-lasso 
regularization, despite its non-differentiability in 0 . 

6 Numerical experiments 

In this section, we study the behavior of four dif¬ 
ferent versions of optimal transport applied to DA 
problem. In the rest of the section, OT-exact is the 
original transport problem ( 8 ), OT-IT the Information 
theoretic regularized one (9), and the two proposed 
class-based regularized ones are denoted OT-GL and 
OT-Laplace, corresponding respectively to the group- 
lasso (Equation (17)) and Laplacian (Equation (18)) 
regularization terms. We also present some results 
with our previous class-label based regularizer built 
upon an Ip — d norm: OT-LpLl [14]. 

6.1 Two moons: simulated problem with controi- 
iabie compiexity 

In the first experiment, we consider the same toy 
example as in [ 22 ]. The simulated dataset consists 
of two domains: for the source, the standard two 
entangled moons data, where each moon is associated 
to a specific class (See Figure 3(a)). The target domain 
is built by applying a rotation to the two moons, 
which allows to consider an adaptation problem with 
an increasing difficulty as a function of the rotation 
angle. This example is notably interesting because 


Target rotation angle 

10° 

20° 

30° 

40° 

50° 

70° 

90° 

SVM (no adapt.) 

0 

0.104 

0.24 

0.312 

0.4 

0.764 

0.828 

DASVM [9] 

0 

0 

0.259 

0.284 

0.334 

0.747 

0.82 

PBDA [22] 

0 

0.094 

0.103 

0.225 

0.412 

0.626 

0.687 

OT-exact 

0 

0.028 

0.065 

0.109 

0.206 

0.394 

0.507 

OT-IT 

0 

0.007 

0.054 

0.102 

0.221 

0.398 

0.508 

OT-GL 

0 

0 

0 

0.013 

0.196 

0.378 

0.508 

OT-Laplace 

0 

0 

0.004 

0.062 

0.201 

0.402 

0.524 


TABLE 1: Mean error rate over 10 realizations for the 
two moons simulated example. 


the corresponding problem is clearly non-linear, and 
because the input dimensionality is small, 2 , which 
leads to poor performances when applying methods 
based on subspace alignment {e.g. [23], [34]). 

We follow the same experimental protocol as in [ 22 ], 
thus allowing for a direct comparison with the state- 
of-the-art results presented therein. The source do¬ 
main is composed of two moons of 150 samples each. 
The target domain is also sampled from these two 
shapes, with the same number of examples. Then, the 
generalization capability of our method is tested over 
a set of 1000 samples that follow the same distribution 
as the target domain. The experiments are conducted 
10 times, and we consider the mean classification error 
as comparison criterion. As a classifier, we used a 
SVM with a Gaussian kernel, whose parameters were 
set by 5-fold cross-validation. We compare the adap¬ 
tation results with two state-of-the-art methods: the 
DA-SVM approach [9] and the more recent PBDA [22], 
which has proved to provide competitive results over 
this dataset. 

Results are reported in Table 1. Our first observation 
is that all the methods based on optimal transport 
behave better than the state-of-the-art methods, in 
particular for low rotation angles, where results indi¬ 
cate that the geometrical structure is better preserved 
through the adaptation by optimal transport. Also, for 
large angle {e.g. 90°), the final score is also signifi¬ 
cantly better than other state-of-the-art method, but 
falls down to a 0.5 error rate, which is natural since in 
this configuration a transformation of —90°, implying 
an inversion of labels, would have led to similar em¬ 
pirical distributions. This clearly shows the capacity of 
our method to handle large domain transformations. 
Adding the class-label information into the regulariza¬ 
tion also clearly helps for the mid-range angle values, 
where the adaptation shows nearly optimal results up 
to angles < 40°. For the strongest deformation (> 70° 
rotation), no clear winner among the OT methods can 
be found. We think that, regardless of the amount and 
type of regularization chosen, the classification of test 
samples becomes too much tributary of the training 
samples. These ones mostly come from the denser part 
of jis and as a consequence, the less dense parts of this 
PDF are not satisfactorily transported. This behavior 
can be seen in Figure 3d. 
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(a) source domain 


(b) rotation=20° 


(c) rotation=40° 


(d) rotation=90° 


Fig. 3: Illustration of the classification decision boundary produced by OT-Laplace over the two moons example 
for increasing rotation angles. The source domain is represented as coloured points. The target domain is 
depicted as points in grey (best viewed with colors). 


6.2 Visual adaptation datasets 

We now evaluate our method on three challenging 
real world vision adaptation tasks, which have at¬ 
tracted a lot of interest in recent computer vision lit¬ 
erature [39]. We start by presenting the datasets, then 
the experimental protocol, and finish by providing 
and discussing the results obtained. 

6.2.1 Datasets 

Three types of image recognition problems are con¬ 
sidered: digits, faces and miscellaneous objects recog¬ 
nition. This choice of datasets was already featured 
in [34]. A summary of the properties of each domain 
considered in the three problems is provided in Ta¬ 
ble 2. An illustration of some examples of the different 
domains for a particular class is shown in Figure 4. 
Digit recognition. As source and target domains, we 
use the two digits datasets USPS and MNIST, that 
share 10 classes of digits (single digits 0 — 9). We 
randomly sampled 1,800 and 2,000 images from each 
original dataset. The MNIST images are resized to the 
same resolution as that of USPS (16 x 16). The grey 
levels of all images are then normalized to obtain a 
final common feature space for both domains. 

Face recognition. In the face recognition experiment, 
we use the PIE ("Pose, Illumination, Expression”) 
dataset, which contains 32 x 32 images of 68 indi¬ 
viduals taken under various pose, illumination and 
expressions conditions. The 4 experimental domains 
are constructed by selecting 4 distinct poses: PIE05 
(C05, left pose), PIE07 (C07, upward pose), PIE09 
(C09, downward pose) and PIE29 (C29, right pose). 
This allows to define 12 different adaptation problems 
with increasing difficulty (the most challenging being 
the adaptation from right to left poses). Let us note 
that each domain has a strong variability for each 
class due to illumination and expression variations. 
Object recognition. We used the Caltech-Office 
dataset [42], [24], [23], [54], [39]. The dataset contains 
images coming from four different domains: Ama¬ 
zon (online merchant), the Caltech-256 image collec¬ 


Problem 

Domains 

Dataset 

# Samples 

# Features 

# Classes 

Abbr. 

Digits 

USPS 

USPS 

1800 

256 

10 

U 

MNIST 

MNIST 

2000 

256 

10 

M 


PIE05 

PIE 

3332 

1024 

68 

PI 


PIE07 

PIE 

1629 

1024 

68 

P2 

Faces 

PIE09 

PIE 

1632 

1024 

68 

P3 


PIE29 

PIE 

1632 

1024 

68 

P4 


Calltech 

Calltech 

1123 

800|4096 

10 

c 


Amazon 

Office 

958 

800 4096 

10 

A 

UbjGCts 

Webcam 

Office 

295 

800 4096 

10 

W 


DSLR 

Office 

157 

800 4096 

10 

D 


TABLE 2: Summary of the domains used in the visual 
adaptation experiment 

tion [25], Webcam (images taken from a webcam) and 
DSLR (images taken from a high resolution digital 
SLR camera). The variability of the different domains 
come from several factors: presence/absence of back¬ 
ground, lightning conditions, noise, etc. We consider 
two feature sets: 

• SURF descriptors as described in [42], used to 
transform each image into a 800 bins histogram. 
These histograms are subsequently normalized 
and reduced to standard scores. 

• two DeCAF deep learning features sets [19]: these 
features are extracted as the sparse activation of 
the neurons from the fully connected 6th and 
7th layers of a convolutional network trained 
on imageNet and then fine tuned on the visual 
recognition tasks considered here. As such, they 
form vectors with 4096 dimensions. 

6.2.2 Experimental setup 

Following [23], the classification is conducted using 
a 1-Nearest Neighbor (INN) classifier, which has the 
advantage of being parameter free. In all experiments, 
INN is trained with the adapted source data, and 
evaluated over the target data to provide a classifi¬ 
cation accuracy score. We compare our optimal trans¬ 
port solutions to the following baseline methods that 
are particularly well adapted for image classification: 

• INN is the original classifier without adaptation 
and constitutes a baseline for all experiments; 
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Fig. 4: Examples from the datasets used in the visual 
adaptation experiment. 5 random samples from one 
class are given for all the considered domains. 


• PCA, which consists in applying a projection 
on the first principal components of the joint 
source/target distribution (estimated from the 
concatenation of source and target samples); 

• GFK, Geodesic Flow Kernel [23]; 

• TSL, Transfer Subspace Learning [44], which op¬ 
erates by minimizing the Bregman divergence 
between the domains embedded in lower dimen¬ 
sional spaces; 

• JDA, Joint Distribution Adaptation [34], which 
extends the Transfer Component Analysis algo¬ 
rithm [38]; 

In unsupervised DA no target labels are available. 
As a consequence, it is impossible to consider a cross- 
validation step for the hyper-parameters of the differ¬ 
ent methods. However, and in order to compare the 
methods fairly, we follow the following protocol. For 
each source domain, a random selection of 20 samples 
per class (with the only exception of 8 for the DSLR 
dataset) is adopted. Then the target domain is equiv¬ 
alently partitioned in a validation and test sets. The 
validation set is used to obtain the best accuracy in the 
range of the possible hyper-parameters. The accuracy, 
measured as the percent of correct classification over 
all the classes, is then evaluated on the testing set, 
with the best selected hyper-parameters. This strategy 
normally prevents overfitting on the testing set. The 
experimentation is conducted 10 times, and the mean 
accuracy over all these realizations is reported. 

We considered the following parameter range : 
for subspace learning methods (PCA,TSL, GFK, and 
JDA) we considered reduced /c-dimensional spaces 
with k G {10,20,..., 70}. A linear kernel was cho¬ 
sen for all the methods with a kernel formula¬ 
tion. For the all methods requiring a regularization 
parameter, the best value was searched in A = 
{0.001,0.01,0.1,1,10,100,1000}. The A and r] param¬ 
eters of our different regularizers (Equation (16)), are 
validated using the same search interval. In the case 
of the Laplacian regularization (OT-Laplace), St is 
a binary matrix which encodes a nearest neighbors 
graph with a 8-connectivity. For the source domain. 


Ss is filtered such that connections between elements 
of different classes are pruned. Finally, we set the a 
value Equation (20) to 0.5. 

6.2.3 Results on unsupervised domain adaptation 

Results of the experiment are reported in Table 3 
where the best performing method for each domain 
adaptation problem is highlighted in bold. On av¬ 
erage, all the OT-based domain adaptation methods 
perform better than the baseline methods, except in 
the case of the PIE dataset, where JDA outperforms 
the OT-based methods in 7 out of 12 domain pairs. 
A possible explanation is that the dataset contains a 
lot of classes (68), and the EM-like step of JDA, which 
allows to take into account the current results of classi¬ 
fication on the target, is clearly leading to a benefit. We 
notice that TSL, which is based on a similar principle 
of distribution divergence minimization, almost never 
outperforms our regularized strategies, except on pair 
A^C. Among the different optimal transport strate¬ 
gies, OT-Exact leads to the lowest performances. OT- 
IT, the entropy regularized version of the transport, is 
substantially better than OT-Exact, but is still inferior 
to the class-based regularized strategies proposed in 
this paper. The best performing strategies are clearly 
OT-GL and OT-Laplace with a slight advantage for 
OT-GL. OT-LpLl, which is based on a similar regu¬ 
larization strategy as OT-GL, but with a different opti¬ 
mization scheme, has globally inferior performances, 
except on some pairs of domains (e.g. C^A ) where 
it achieves better scores. On both digits and objects 
recognition tasks, OT-GL significantly outperforms 
the baseline methods. 

In the next experiment (Table 4), we use the same 
experimental protocol on different features produced 
by the DeCAF deep learning architecture [19]. We 
report the results of the experiment conducted on the 
Office-Caltech dataset, with the OT-IT and OT-GL 
regularization strategies. For comparison purposes, 
JDA is also considered for this adaptation task. The 
results show that, even though the deep learning 
features yield naturally a strong improvement over 
the classical SURF features, the proposed OT meth¬ 
ods are still capable of improving significantly the 
performances of the final classification (up to more 
than 20 points in some case, e.g. D^A or A^W). This 
clearly shows how OT has the capacity to handle non- 
stationarity in the distributions that the deep architec¬ 
ture has difficulty handling. We also note that using 
the features from the 7th layer instead of the 6th does 
not bring a strong improvement in the classification 
accuracy, suggesting that part of the work of the 7th 
layer is already performed by the optimal transport. 

6.2.4 Semi-supervised domain adaptation 

In this last experiment, we assume that few labels are 
available in the target domain. We thus benchmark 







IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. X, NO. X, JANUARY XX 


12 


TABLE 3: Overall recognition accuracies in % obtained over all domains pairs using the SURF features. 
Maximum values for each pair is indicated in bold font. 


Domains INN PCA GFK TSL JDA OT-exact OT-IT OT-Laplace OT-LpLq OT-GL 


U^M 

39.00 

37.83 

44.16 

40.66 

54.52 

50.67 

53.66 

57.42 

60.15 

57.85 

M^U 

58.33 

48.05 

60.96 

53.79 

60.09 

49.26 

64.73 

64.72 

68.07 

69.96 

mean 

48.66 

42.94 

52.56 

47.22 

57.30 

49.96 

59.20 

61.07 

64.11 

63.90 

P1^P2 

23.79 

32.61 

22.83 

34.29 

67.15 

52.27 

57.73 

58.92 

59.28 

59.41 

Pl^PS 

23.50 

38.96 

23.24 

33.53 

56.96 

51.36 

57.43 

57.62 

58.49 

58.73 

P1^P4 

15.69 

30.82 

16.73 

26.85 

40.44 

40.53 

47.21 

47.54 

47.29 

48.36 

P2^P1 

24.27 

35.69 

24.18 

33.73 

63.73 

56.05 

60.21 

62.74 

62.61 

61.91 

P2^P3 

44.45 

40.87 

44.03 

38.35 

68.42 

59.15 

63.24 

64.29 

62.71 

64.36 

P2^P4 

25.86 

29.83 

25.49 

26.21 

49.85 

46.73 

51.48 

53.52 

50.42 

52.68 

P3^P1 

20.95 

32.01 

20.79 

39.79 

60.88 

54.24 

57.50 

57.87 

58.96 

57.91 

P3^P2 

40.17 

38.09 

40.70 

39.17 

65.07 

59.08 

63.61 

65.75 

64.04 

64.67 

P3^P4 

26.16 

36.65 

25.91 

36.88 

52.44 

48.25 

52.33 

54.02 

52.81 

52.83 

P4^P1 

18.14 

29.82 

20.11 

40.81 

46.91 

43.21 

45.15 

45.67 

46.51 

45.73 

P4^P2 

24.37 

29.47 

23.34 

37.50 

55.12 

46.76 

50.71 

52.50 

50.90 

51.31 

P4^P3 

27.30 

39.74 

26.42 

46.14 

53.33 

48.05 

52.10 

52.71 

51.37 

52.60 

mean 

26.22 

34.55 

26T15 “ 

■ 36T10 “ 

56.69 

50.47 

54.89 

56.10 

55.45 

55.88 

C^A 

20.54 

35.17 

35.29 

45.25 

40.73 

30.54 

37.75 

38.96 

48.21 

44.17 

C^W 

18.94 

28.48 

31.72 

37.35 

33.44 

23.77 

31.32 

31.13 

38.61 

38.94 

C^D 

19.62 

33.75 

35.62 

39.25 

39.75 

26.62 

34.50 

36.88 

39.62 

44.50 

A^C 

22.25 

32.78 

32.87 

38.46 

33.99 

29.43 

31.65 

33.12 

35.99 

34.57 

A^W 

23.51 

29.34 

32.05 

35.70 

36.03 

25.56 

30.40 

30.33 

35.63 

37.02 

A^D 

20.38 

26.88 

30.12 

32.62 

32.62 

25.50 

27.88 

27.75 

36.38 

38.88 

W^C 

19.29 

26.95 

27.75 

29.02 

31.81 

25.87 

31.63 

31.37 

33.44 

35.98 

W^A 

23.19 

28.92 

33.35 

34.94 

31.48 

27.40 

37.79 

37.17 

37.33 

39.35 

W^D 

53.62 

79.75 

79.25 

80.50 

84.25 

76.50 

80.00 

80.62 

81.38 

84.00 

D^C 

23.97 

29.72 

29.50 

31.03 

29.84 

27.30 

29.88 

31.10 

31.65 

32.38 

D^A 

27.10 

30.67 

32.98 

36.67 

32.85 

29.08 

32.77 

33.06 

37.06 

37.17 

D^W 

51.26 

71.79 

69.67 

77.48 

80.00 

65.70 

72.52 

76.16 

74.97 

81.06 

mean 

28.47 

37.98 

39.21 

42.97 

44.34 

36.69 

42.30 

43.20 

46.42 

47.70 


TABLE 4: Results of adaptation by optimal transport 
using DeCAF features. 


Layer 6 Layer 7 


Domains 

DeCAF 

JDA 

OT-IT 

OT-GL 

DeCAF 

JDA 

OT-IT 

OT-GL 

C^A 

79.25 

88.04 

88.69 

92.08 

85.27 

89.63 

91.56 

92.15 

C^W 

48.61 

79.60 

75.17 

84.17 

65.23 

79.80 

82.19 

83.84 

C^D 

62.75 

84.12 

83.38 

87.25 

75.38 

85.00 

85.00 

85.38 

A^C 

64.66 

81.28 

81.65 

85.51 

72.80 

82.59 

84.22 

87.16 

A^W 

51.39 

80.33 

78.94 

83.05 

63.64 

83.05 

81.52 

84.50 

A^D 

60.38 

86.25 

85.88 

85.00 

75.25 

85.50 

86.62 

85.25 

W^C 

58.17 

81.97 

74.80 

81.45 

69.17 

79.84 

81.74 

83.71 

W^A 

61.15 

90.19 

80.96 

90.62 

72.96 

90.94 

88.31 

91.98 

W^D 

97.50 

98.88 

95.62 

96.25 

98.50 

98.88 

98.38 

91.38 

D^C 

52.13 

81.13 

77.71 

84.11 

65.23 

81.21 

82.02 

84.93 

D^A 

60.71 

91.31 

87.15 

92.31 

75.46 

91.92 

92.15 

92.92 

D^W 

85.70 

97.48 

93.77 

96.29 

92.25 

97.02 

96.62 

94.17 

mean 

65.20 

86.72 

83.64 

88.18 

75.93 

87.11 

87.53 

88.11 


our semi-supervised approach on SURF features ex¬ 
tracted from the Office-Caltech dataset. We consider 
that only 3 labeled samples per class are at our 
disposal in the target domain. In order to disentangle 
the benefits of the labeled target samples brought by 
our optimal transport strategies from those brought 
by the classifier, we make a distinction between two 
cases: in the first one, denoted as ''Unsupervised + 
labels", we consider that the label target samples are 
available only at the learning stage, after an unsu¬ 
pervised domain adaptation with optimal transport. 
In the second case, denoted as "semi-supervised", 
labels in the target domain are used to compute a new 
transportation plan, through the use of the proposed 


TABLE 5: Results of semi-supervised adaptation with 
optimal transport using the SURF features. 


Domains 

Unsupervised + labels 

Semi-supervised 

OT-IT 

OT-GL 

OT-IT 

OT-GL 

MMDT [28] 

C^A 

37.0 ± 0.5 

41.4 ± 0.5 

46.9 ± 3.4 

47.9 ± 3.1 

49.4 ± 0.8 

C^W 

28.5 ± 0.7 

37.4 ± 1.1 

64.8 ± 3.0 

65.0 ± 3.1 

63.8 ± 1.1 

C^D 

35.1 ± 1.7 

44.0 ± 1.9 

59.3 ± 2.5 

61.0 ± 2.1 

56.5 ± 0.9 

A^C 

32.3 ± 0.1 

36.7 ± 0.2 

36.0 ± 1.3 

37.1 ± 1.1 

36.4 ± 0.8 

A^W 

29.5 ± 0.8 

37.8 ± 1.1 

63.7 ± 2.4 

64.6 ± 1.9 

64.6 ± 1.2 

A^D 

36.9 ± 1.5 

46.2 ± 2.0 

57.6 ± 2.5 

59.1 ± 2.3 

56.7 ± 1.3 

W->C 

35.8 ± 0.2 

36.5 ± 0.2 

38.4 ± 1.5 

38.8 ± 1.2 

32.2 ± 0.8 

W^A 

39.6 ± 0.3 

41.9 ± 0.4 

47.2 ± 2.5 

47.3 ± 2.5 

47.7± 0.9 

W->D 

77.1 ± 1.8 

80.2 db 1.6 

79.0 ± 2.8 

79.4 ± 2.8 

67.0 ± 1.1 

D^C 

32.7 ± 0.3 

34.7 ± 0.3 

35.5 ± 2.1 

36.8 ± 1.5 

34.1 ± 1.5 

D^A 

34.7 ± 0.3 

37.7 ± 0.3 

45.8 ± 2.6 

46.3 ± 2.5 

46.9 dz 1.0 

D^W 

81.9 ± 0.6 

84.5 ± 0.4 

83.9 ± 1.4 

84.0 ± 1.5 

74.1 ± 0.8 

mean 

41.8 

46.6 

54.8 

55.6 

52.5 


semi-supervised regularization term in Equation (21)). 

Results are reported in Table 5. They clearly show 
the benefits of the proposed semi-supervised regu¬ 
larization term in the definition of the transportation 
plan. A comparison with the state-of-the-art method 
of Hoffman and colleagues [28] is also reported, and 
shows the competitiveness of our approach. 

7 Conclusion 

In this paper, we described a new framework based on 
optimal transport to solve the unsupervised domain 
adaptation problem. We proposed two regulariza¬ 
tion schemes to encode class-structure in the source 
domain during the estimation of the transportation 
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plan, thus enforcing the intuition that samples of 
the same class must undergo similar transformation. 
We extended this OT regularized framework to the 
semi-supervised domain adaptation case, i.e. the case 
where few labels are available in the target domain. 
Regarding the computational aspects, we suggested 
to use a modified version of the conditional gradi¬ 
ent algorithm, the generalized conditional gradient 
splitting, which enables the method to scale up to 
real-world datasets. Finally, we applied the proposed 
methods on both synthetic and real world datasets. 
Results show that the optimal transportation domain 
adaptation schemes frequently outperform the com¬ 
peting state-of-the-art methods. 

We believe that the framework presented in this pa¬ 
per will lead to a paradigm shift for the domain adap¬ 
tation problem. Estimating a transport is much more 
general than finding a common subspace, but comes 
with the problem of finding a proper regularization 
term. The proposed class-based or Laplacian regular- 
izers show very good performances, but we believe 
that other types of regularizer should be investigated. 
Indeed, whenever the transformation is induced by 
a physical process, one may want the transport map 
to enforce physical constraints. This can be included 
with dedicated regularization terms. We also plan to 
extend our optimal transport framework to the multi- 
domain adaptation problem, where the problem of 
matching several distributions can be cast as a multi¬ 
marginal optimal transport problem. 
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